智能论文笔记

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Amrit Singh Bedi , Anjaly Parayil , Junyu Zhang , Mengdi Wang , Alec Koppel

分类：机器学习 | 人工智能 | (统计)机器学习

2021-06-15

Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.

translated by 谷歌翻译

果树的休眠修剪是维持树木健康和确保高质量果实的重要任务。由于劳动力的可用性降低，修剪是机器人自动化的主要候选者。但是，修剪也代表了机器人的独特困难问题，需要在可变照明条件下以及在复杂的，高度非结构化的环境中进行感知，修剪点的确定和操纵。在本文中，我们介绍了一种用于修剪甜樱桃树的系统（在平面树建筑中，称为直立的果实分支配置），该系统整合了我们先前关于感知和操纵的工作的各种子系统。最终的系统能够完全自主运行，并且需要对环境的最低控制。我们通过在甜蜜的樱桃果园中进行现场试验来验证系统的性能，最终取得了58％的削减速度。尽管不完全稳健，并且需要改善吞吐量，但我们的系统是第一个在果树上运行的系统，并代表了将来可以改进的有用的基础平台。

translated by 谷歌翻译